Project Overview PR2.20ΒΆ

The project goal is to implement clustering on metereological data collected for Emilia-Romagna provinces.

The aim of the analysis is to discover which provinces are more similar in terms of polluting chemicals.

The data is composed of different csv files, where each file represents a province.

The main challenge was to apply clustering to time series data, which requires more advanced methodologies than classical clustering algorithms.

SetupΒΆ

In this section we show the importing procedure and cleaning steps performed on the dataset.

Each dataset is imported into a dictionary where each key is the province name

ImportΒΆ

import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

script_path = os.getcwd()
path = os.path.dirname(script_path)

def import_data(path):
    data = R'data'
    data_path = os.path.join(path, data)
    
    # List all files in the folder
    csv_files = [file for file in os.listdir(data_path) if file.endswith(".csv") if file != 'InfoComune.csv']

    # Create an empty dictionary to store DataFrames
    dataframes = {}

    # Iterate through each CSV file
    for file in csv_files:
        # Extract the file name
        df_name = os.path.splitext(file)[0]
        
        # Create the DataFrame and store it in the dictionary
        dataframes[df_name] = pd.read_csv(os.path.join(data_path, file), header=0, skiprows = [1])
        
    return dataframes


dataframes = import_data(path)

# store dictionary items in specific variables to make it easier to loop through them
datasets = dataframes.values()
provinces = dataframes.keys()
data_prov_pairs = dataframes.items()

With the following we keep track of the main variables we wish to keep and their position in the datasets

# columns to keep the average value only
pollutants = ['CO', 'NH3', 'NMVOC', 'NO2', 'NO', 'O3', 'PANS', 'PM10', 'PM2.5', 'SO2']

# metereological information
met = ['TG', 'TN', 'TX', 'HU', 'PP', 'QQ', 'RR']
met_pos = range(6, 13)
# date values
date = ['YYYY', 'MM', 'DD']
date_pos = list(range(3))

Cleaning and filteringΒΆ

def clean_data(data):
    for province, df in data_prov_pairs:
        # change missing values to the proper format
        df.replace('---', np.nan, inplace = True)
        # ensure a unique format
        df = df.convert_dtypes()

        # rename the columns for date and metereological information
        old_date = df.columns[date_pos]
        old_met = df.columns[met_pos]
        
        df.rename(columns=dict(zip(old_date, date)), inplace=True)
        df.rename(columns=dict(zip(old_met, met)), inplace=True)
                
        df = df[selected_columns]

        # Combine 'YYYY', 'MM', 'DD' columns into a new 'date' column
        df['date'] = pd.to_datetime(df[['YYYY', 'MM', 'DD']].astype(str).agg('-'.join, axis=1), format='%Y-%m-%d')
        
        # Remove 'YYYY', 'MM', 'DD' columns
        df.drop(['YYYY', 'MM', 'DD'], axis=1, inplace=True)
                
        # Reorder columns with 'date' as the first column
        df = df[['date'] + [col for col in df.columns if col != 'date']]
        
        # first convert to numeric the columns in met and pollutants, since they are strings
        df[numerics] = df[numerics].apply(pd.to_numeric, errors = 'coerce')
        # round to the second decimal number for better visualization
        df[numerics] = df[numerics].round(2)
        
        # we want to filter the series so that we don't have missing values
        # We'll start from 2018-01-01 and move until 2020-12-28

        df = df[(df['date'] >= pd.to_datetime('2018-01-01')) & (df['date'] <= pd.to_datetime('2020-12-28'))]
        
        dataframes[province] = df

    return dataframes

dataframes = clean_data(dataframes)

Missing valuesΒΆ

dataframes['bologna'][dataframes['bologna'].isnull().any(axis=1)]

We find that 2020-02-29 is still missing for some variables.

Replacing missing valuesΒΆ

Since the remaining missing value is not missing at the beginning or at the end of the series, it was decided to impute the value with the median, since it's robust to extreme values

def impute_median(data):
    from sklearn.impute import SimpleImputer

    imputer = SimpleImputer(strategy='median')

    for province, df in data_prov_pairs:
        df[numerics] = imputer.fit_transform(df[numerics])
        
        dataframes[province] = df
    
    return dataframes

dataframes = impute_median(dataframes)

SmoothingΒΆ

We applied smoothing to the series to reduce the influence of extreme observations, both on clustering and on visualizations.

The Savitzky-Golay filter applies polynomial smooting on the time span indicated by the window parameter.

The advantage of this function compared to other smoothing methods like moving average is that it doesn't introduce missing values in the series.

For the variable PP, for which we want to show the plot, we applied different and more strict parameters because it showed more extreme values than the other variables.

from scipy.signal import savgol_filter

def smooth(window, window_pp, poly_pp, poly = 2):
        
    for province, df in data_prov_pairs:
        
        for column in df[numerics].columns:
            # Extract numerical values from the DataFrame
            values = df[column].values
            
            # Apply Savitzky-Golay filter to the numerical values
            # for PP we need higher smooting due to outliers
            if column == 'PP':
                window = window_pp
                poly = poly_pp
                
            smoothed_values = savgol_filter(values, window, poly)
            # Update the DataFrame with the smoothed values
            df[column] = smoothed_values
            
        dataframes[province] = df

smooth(window = 10, window_pp = 25, poly_pp = 5)

ConcatenatingΒΆ

Concatenation was required to produce plots and analysis by group.

In the concatenation process, the group variable is appended at the end of the series to indicate membership.

The genarality of the function allows to apply it both for analysis by province and by clusters.

def concat_group(groups, group_name: str):
    # Create an empty list to store modified dataframes
    dfs = []

    # Iterate through the dictionary values and groups labels
    for df, group in zip(datasets, groups):
        
        # Add a 'group' column
        df[group_name] = group
        
        # Append the modified dataframe to the list
        dfs.append(df)
    
    # Concatenate all dataframes in the list along the rows
    full_df = pd.concat(dfs, ignore_index=True)
    full_df
    
    return full_df

VisualizationΒΆ

Here we show visualizations of the smoothed series.

The function relies on the output of concatenation to produce plots by group.

From the plots we see that variables about meteorological informtation (TG, TN, TX, HU, PP, QQ, RR) are highly similar across provinces, therefore we opted for excluding them when clustering, given the problem is already high dimensional.

def plot_group(subfolder: str, vars, group: str, full_df):
    # subfolder should be a string indicating the name of the subfolder in the figures_folder
    # vars should be the list containing the variable names, therefore either pollutants, mets, numerics
    # group should be the variable we want to group by (eg: province, clusters,...)
    # to group we need to have created the full dataframe before with the specific group variable
    
    sns.set(style="whitegrid")
    figures = R'figures'
    figures_folder = os.path.join(path, figures)
    output_folder = os.path.join(figures_folder, subfolder)
    os.makedirs(output_folder, exist_ok=True)

    for var in vars:
        
        # Plot the time series for each province
        plt.figure(figsize=(12, 6))
        sns.lineplot(x='date', y=var, hue=full_df[group], data=full_df)

        plt.title(f'Meteorological Information Over Time by {group}')
        plt.xlabel('Date')
        plt.ylabel(var)
        plt.xticks(rotation=45)  # Rotate x-axis labels for better readability
        
        save_path = os.path.join(output_folder, f'{var}.png')
        plt.savefig(save_path, bbox_inches='tight')
        
        plt.show()
full_df = concat_group(provinces, 'province')
plot_group('smooth_series', numerics, 'province', full_df)

CO.png HU.png NH3.png NMVOC.png NO.png NO2.png O3.png PANS.png PM2.5.png PM10.png SO2.png PP.png QQ.png RR.png TG.png TN.png TX.png

Time Series K-MeansΒΆ

In this section we implemented a variation of KMeans suited for clustering of time series data (Time Series KMeans).

The main difference from the traditional KMeans algorithm is that it requires a different distance metric than the Euclidean distance. The reasons is that in time series, Euclidean distance can introduce distortions and not represent distances of time series correctly.

The most common distance metric used for time series is Dynamic Time Warping (DTW).

#importing of the required packages
from tslearn.clustering import TimeSeriesKMeans, silhouette_score
from sklearn.preprocessing import MinMaxScaler

Data preparation for clusteringΒΆ

First, we normalized the data, which is a necessary step before applying any clustering algorithm to ensure distances are not distorted.

for df in datasets:
    df[pollutants] = MinMaxScaler().fit_transform(df[pollutants])

Data structureΒΆ

The package used (tslearn) required creating a 3D array to pass as input, where each layer in the 3rd dimension is a specific province.

#Prepare input data
X = []

for df in datasets:
    group_data = df[pollutants].values
    
    group_data = np.expand_dims(group_data, axis=0)
    X.append(group_data)

#Stack the list of arrays to create a 3D array
X = np.vstack(X)

Optimal number of clustersΒΆ

The first step of Kmeans is determining the optimal number of cluster (k).

This is done by running the algorithm for various choices of k, and for each result we extracted 2 common evaluation metrics:

  • Inertia (Sum of Within clusters distances)
  • Silhouette Scores.
inertia = []
silhouette = []
K = list(range(2, 9))

for k in K:
    km = TimeSeriesKMeans(n_clusters=k, n_init=5, metric='dtw', random_state=0)
    
    labels = km.fit_predict(X)
    
    silhouette_avg = silhouette_score(X, labels, metric='dtw')
    
    inertia.append(km.inertia_)
    silhouette.append(silhouette_avg)

Optimal number of clustersΒΆ

In the plot showing inertia, it's difficult to determine for which k the rate of change of inertia decreases significantly. On the contrary, silhouette score is clearly highest for 4 clusters.

# Plotting inertia vs slihouette
plt.figure(figsize=(10, 5))
plt.subplot(1, 2, 1)
plt.plot(K, inertia, marker='o')
plt.title('Inertia vs. Number of Clusters')
plt.xlabel('Number of Clusters')
plt.ylabel('Inertia')

plt.subplot(1, 2, 2)
plt.plot(K, silhouette, marker='o')
plt.title('Silhouette Score vs. Number of Clusters')
plt.xlabel('Number of Clusters')
plt.ylabel('Silhouette Score')

plt.tight_layout()
figures_folder = os.path.join(path, R'figures')
plt.savefig(os.path.join(figures_folder, 'inertia_silhouette.png'))
plt.show()

Optimal number of clustersΒΆ

inertia_silhouette.png

KMeans with 4 clustersΒΆ

km4 = TimeSeriesKMeans(n_clusters=4, n_init=5, metric='dtw', random_state=0)
clusters = km4.fit_predict(X)
def table_clusters(provinces = provinces, clusters = clusters):
    # Specify the results folder
    results = R"results"
    results_path = os.path.join(path, results)
    
    # Create a DataFrame
    cluster_df = pd.DataFrame(list(zip(provinces, clusters)), columns=['Province', 'Cluster'])
    
    # Create a figure and axis
    fig, ax = plt.subplots(figsize=(5, 5))

    # Hide the axes
    ax.axis('off')

    # Create a table and add it to the axis
    table = ax.table(cellText=cluster_df.values, colLabels=cluster_df.columns,
                    cellLoc='center', loc='center', edges = 'open')

    # Make column labels bold
    for (i, j), cell in table.get_celld().items():
        if i == 0:
            cell.set_text_props(fontweight='bold')

    # Adjust column width
    table.auto_set_column_width([0, 1])

    # Adjust the position of the table within the figure
    table.set_fontsize(13)  # Adjust font size
    table.scale(1, 2)  # Scale the table

    save_file = os.path.join(results_path, 'cluster_provinces.png')
    # Save the figure as a PNG image in the specified folder
    fig.savefig(save_file, bbox_inches='tight', pad_inches=0.5, dpi=300)

table_clusters()

Provinces and clustersΒΆ

cluster_provinces.png

Summary statistics of clustersΒΆ

Here we show tables containing statistics for the various clusters.

df_cluster = concat_group(clusters, 'cluster')

def summary(full_df):
    from pandas.plotting import table
    
    grouped_df = full_df[['cluster'] + pollutants].groupby('cluster')
    summary_stats = grouped_df.describe()

    stats_folder = R'results\statistics'
    stats_path = os.path.join(path, stats_folder)
    os.makedirs(stats_path, exist_ok=True)
    
    for variable in pollutants:
        selected_stats = summary_stats[variable][['count', 'mean']]
        selected_stats = selected_stats.reset_index()
        
        # Add a new column 'cluster' with values 0, 1, 2, 3
        selected_stats['cluster'] = [0, 1, 2, 3]
        # prepare columns to ensure maximum 4 digits
        selected_stats['mean'] = round(selected_stats['mean'], 4)
        # ensure integers do not show decimal points by converting them to strings
        selected_stats['count'] = selected_stats['count'].astype(int).astype(str)
        selected_stats['cluster'] = selected_stats['cluster'].astype(str)
        
        # Move 'cluster' column to the first position
        selected_stats = selected_stats[['cluster', 'count', 'mean']]

        # Plotting parameters
        fig, ax = plt.subplots(figsize=(6, 4))
        ax.axis('off')  # Turn off axis

        cell_data = [selected_stats.columns] + selected_stats.values.tolist()
        table = ax.table(cellText=cell_data, loc='center', cellLoc='center', colLabels=None, edges='open')

        # Adjust column width
        table.auto_set_column_width([0, 1])

        # Adjust the position of the table within the figure
        table.set_fontsize(15)  # Adjust font size
        table.scale(1, 1.5)  # Scale the table
        
        # Add a title
        ax.set_title(f"Summary Statistics for {variable}", fontsize=18, y=0.9)

        # Make column labels bold
        for (i, j), cell in table.get_celld().items():
            if i == 0:
                cell.set_text_props(fontweight='bold')
    
        # Save the plot as an image
        save_name = os.path.join(stats_path, f'{variable}_table.png')
        plt.savefig(save_name, bbox_inches='tight', pad_inches=0.2)
        plt.show()

summary(df_cluster)

CO_table.png NH3_table.png NMVOC_table.png NO_table.png NO2_table.png O3_table.png PANS_table.png PM10_table.png PM2.5_table.png SO2_table.png

ConclusionsΒΆ

  • Time Series KMeans with Dynamic Time Warping proposed 4 clusters among the provinces, but the results are difficult to interpret
  • Many variables have similar statistics across clusters, which means the clusters themselves are not extremely different
  • There are some clusters that differ significantly when it comes to some variables:
    • NH3: 0-1
    • NMVOC: 0-3
    • O3: 2-3
    • PM2.5: 3-0/2
  • Graphs by provinces also show that provinces have generally similar levels of the polluting chemicals